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Abstract 

Convex potential minimisation is the de facto approach to binary classification. However, Long and 
Servedio [2010] proved that under symmetric label noise (SLN), minimisation of any convex potential 
over a linear function class can result in classification performance equivalent to random guessing. This 
ostensibly shows that convex losses are not SLN-robust. In this paper, we propose a convex, classification- 
calibrated loss and prove that it is SLN-robust. The loss avoids the Long and Servedio [2010] result by virtue 
of being negatively unbounded. The loss is a modification of the hinge loss, where one does not clamp at 
zero; hence, we call it the unhinged loss. We show that the optimal unhinged solution is equivalent to that 
of a strongly regularised SVM, and is the limiting solution for any convex potential; this implies that strong 
£2 regularisation makes most standard learners SLN-robust. Experiments confirm the SLN-robustness of 
the unhinged loss. 


1 Learning with symmetric label noise 

Binary classification is the canonical supervised learning problem. Given an instance space X, and samples 
from some distribution D over X x {±1}, the goal is to learn a scorer s: X —M with low misclassifica- 
tion error on future samples drawn from D. Our interest is in the more realistic scenario where the learner 
observes samples from a distribution D, which is a corruption of D where labels have some constant proba¬ 
bility of being flipped. The goal is still to perform well with respect to the unobserved distribution D. This is 
known as the problem of learning from symmetric label noise (SLN learning) [Angluin and Laird, 1988]. 

Long and Servedio [2010] proved the following negative result on what is possible in SLN learning; there 
exists a linearly separable D where, when the learner observes some corruption D with symmetric label noise 
of any nonzero rate, minimisation of any convex potential over a linear function class results in classification 
performance on D that is equivalent to random guessing. Ostensibly, this establishes that convex losses are 
not “SLN-robust” and motivates the use of non-convex losses [Stempfel and Ralaivola, 2009, Masnadi-Shirazi 
et al., 2010, Ding and Vishwanathan, 2010, Denchev et al., 2012, Manwani and Sastry, 2013]. 

In this paper, we propose a convex loss and prove that it is SLN-robust. The loss avoids the result of Long 
and Servedio [2010] by virtue of being negatively unbounded. The loss is a modification of the hinge loss 
where one does not clamp at zero; thus, we call it the unhinged loss. We show that this is the unique convex 
loss (up to scaling and translation) that satisfies a notion of “strong SLN-robustness ” (Proposition 4). In 
addition to being SLN-robust, this loss has several attractive properties, such as being classification-calibrated 
(Proposition 5), consistent when minimised on the corrupted distribution (Proposition 6), and having an easily 
computable optimal solution that is the difference of two kernel means (Equation 9). Finally, we show that 
this optimal solution is equivalent to that of a strongly regularised SVM (Proposition 7), and such a result 
holds more generally for any twice-differentiable convex potential (Proposition 8), implying that strong £2 
regularisation endows most standard learners with SLN-robustness. 
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The classifier resulting from minimising the unhinged loss is not new [Devroye et al., 1996, Chapter 10], 
[Scholkopf and Smola, 2002, Section 1.2], [Shawe-Taylor and Cristianini, 2004, Section 5.1]. However, es¬ 
tablishing this classifier’s SLN-robustness, its equivalence to a highly regularised SVM solution, and showing 
the underlying loss uniquely satisfies a notion of strong SLN-robustness, to our knowledge is novel. 


2 Background and problem setup 

Fix an instance space X. We denote by D some distribution over X x {±1}, with random variables (X, Y) ~ 
D. Any D may be expressed via the class-conditional distributions {P,Q) — (P(X | Y = 1),P(X | Y = 
— 1)) and base rate tt = P(Y = 1), or equivalently via the marginal distribution M = P(X) and class- 
probability function rj-. X ^ P(Y = 1 I X = x). We interchangeably write D as or DM,r]- 

2.1 Classifiers, scorers, and risks 

A scorer is any function s: X —M. A loss is any function £: {±1} x M —K. We use £-i,£i to refer to 
£(—1, •) and £{1, •). The £-conditional risk Lg: [0,1] x K —>■ M is defined as L(: (p, u) i—>■ p • £i{v) -f (1 — 
rj) ■ £_i(u). Given a distribution D, the £-risk of a scorer s is defined as 

Lf(s)= E [£(Y,s(X))], (1) 

or equivalently L|^(s) = E [L^(? 7 (X), s(X))]. For a set §, L^(§) is the set of f-risks for all scorers in §. 

K function class is any T C Given some T, the set of restricted Bayes-optimal scorers for a loss £ 
are those scorers in T that minimise the f-risk: 

= ArgminLf (s). 

The set of (unrestricted) Bayes-optimal scorers is §f'* = for T = K^. The restricted £-regret of a 

scorer is its excess risk over that of any restricted Bayes-optimal scorer; 

regretf’^(s) = Lf (s) - inf Lf (f). 

Binary classification is concerned with the risk corresponding to the zero-one loss, : {y, v) i— lyv < 
OJ -f I |z; = 0]. A loss £ is classification-calibrated if all its Bayes-optimal scorers are also optimal for zero- 
one loss: (VZ?) Sf’* C S^j’*. A convex potential is any loss £: (y, v) i—>• (j){yv), where f: M. ^ IR+ is convex, 
non-increasing, differentiable with ^'(0) < 0, and 0(-|-oo) = 0 [Long and Servedio, 2010, Definition 1]. All 
convex potential losses are classification-calibrated [Bartlett et al., 2006, Theorem 2.1]. 

2.2 Learning with symmetric label noise (SLN learning) 

The problem of learning with symmetric label noise {SLN learning) is the following [Angluin and Laird, 1988, 
Kearns, 1998, Blum and Mitchell, 1998, Natarajan et al., 2013]. For some notional “clean” distribution D, 
which we would like to observe, we instead observe samples from some corrupted distribution SLN(i9, p), 
for some p G [0,1/2). The distribution SLN(i9,p) is such that the marginal distribution of instances is 
unchanged, but each label is independently flipped with probability p. The goal is to learn a scorer from these 
corrupted samples such that L^]^(s) is small. 

For any quantity in D, we denote its corrupted counterparts in SLN(i9,p) with a bar, e.g. M for the 
corrupted marginal distribution, and fj for the corrupted class-probability function; additionally, when p is 
clear from context, we will occasionally refer to SLN(Z9, p) by D. By definition of the corruption process, 
the corruption marginal distribution M = M, and [Natarajan et al., 2013, Lemma 7] 

(Vx e X) fi{x) = (1 - 2p) • r]{x) -f p. (2) 
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3 SLN-robustness: formalisation 


For our purposes, a learner (£, 3^) comprises a loss £, and a function class IF, with learning being the search 
for some s G 3^ that minimises the £-risk. Informally, the learner {£, IF) is “robust” to symmetric label noise 
(SLN-robust) if minimising £ over IF gives the same classifier on both the clean distribution D, which the 
learner would like to observe, and SLN(_D, p) for any p G [0,1/2), which the learner actually observes. We 
now formalise this notion, and review what is known about the existence of SLN-robust learners. 


3.1 SLN-robust learners: a formal definition 


For some fixed instance space X, let A denote the set of distributions on X x {±1}. Given a notional “clean” 
distribution D, A4in: A —^ 2^ returns the set of possible corrupted versions of D the learner may observe, 
where labels are flipped with unknown probability p: 


•A*^In ■ D I >■ 


|sLN(Zl,p) |pe 



Equipped with this, we define our notion of SLN-robustness. 

Definition 1 (SLN-robustness). We say that a learner (£, X) is SLN-robust if 


i\fD G A) (V5 G Kin{D)) L?i(§f ^’*) = L?i(§f’^’*). (3) 

That is, SLN-robustness requires that for any level of label noise in the observed distribution D, the clas¬ 
sification performance (wrt D) of the learner is the same as if the learner directly observes D. Unfortunately, 
as we will now see, a widely adopted class of learners is not SLN-robust. 


3.2 Convex potentials with linear function classes are not SLN-robust 

Fix X = and consider learners employing a convex potential £, and a function class of linear scorers 

3^1in = {x l-G- {w, x) \ W G 

This captures e.g. the linear SVM and logistic regression, which are widely studied in theory and applied in 
practice. Unfortunately, these learners are not SLN-robust: Long and Servedio [2010, Theorem 2] give an 
example where, when learning under symmetric label noise, for any convex potential £, the corrupted ^-risk 
minimiser over Tun has classification performance equivalent to random guessing on D. This implies that 
{£, Tiin) is not SLN-robust' as per Definition 1. (All Proofs may be found in Appendix A.) 

Proposition 1 (Long and Servedio [2010, Theorem 2]). Let X = M.‘^for any d > 2. Pick any convex potential 
£. Then, (£, Tun) is not SLN-robust. 

The widespread practical use of convex potential based learners makes Proposition 1 a disheartening 
result, and motivates the search for other learners that are SLN-robust. 


3.3 The fallout: what learners are SLN-robust? 

In light of Proposition 1, there are two ways to proceed in order to obtain SLN-robust learners: either we 
change the class of losses £, or we change the function class T. 

The first approach has been pursued in a large body of work that embraces non-convex losses [Stempfel 
and Ralaivola, 2009, Masnadi-Shirazi et al., 2010, Ding and Vishwanathan, 2010, Denchev et al., 2012, 
Manwani and Sastry, 2013]. However, while such losses avoid the conditions of Proposition 1, this does not 
automatically imply that they are SLN-robust when used with Tun. In Appendix B, we present evidence that 
some of these losses are in fact not SLN-robust when used with Tun. 

The second approach is to instead consider suitably rich T that contains the Bayes-optimal scorer for D, 
e.g. by employing a universal kernel. With this choice, one can still use a convex potential loss; in fact, owing 
to Equation 2, using any classification-calibrated loss will result in an SLN-robust learner when T = K^. 

* Even if we weaken the notion of SLN-robustness to allow for a difference of e E [0,1/2] between the clean and corrupted minimis- 
ers’ performance, Long and Servedio [2010, Theorem 2] implies that in the worst case e = 1/2. 
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Proposition 2. Pick any classification-calibrated L Then, is SLN-robust. 

Both approaches have drawbacks. The hrst approach has a computational penalty, as it requires opti¬ 
mising a non-convex loss. The second approach has a statistical penalty, as estimation rates with a rich J' 
will require a larger sample size. Thus, it appears that SLN-robustness involves a computational-statistical 
tradeoff. 

However, there is a variant of the hrst option: pick a loss that is convex, but not a convex potential. If an 
SLN-robust loss of this type exists, it affords the computational and statistical advantages of minimising con¬ 
vex risks with linear scorers. Manwani and Sastry [2013] demonstrated that square loss, £{y, ?;) = (1 — yv)^, 
is one such loss. We will show that there is a simpler loss that is similarly convex, classihcation-calibrated, 
and SLN-robust, but is not in the class of convex potentials by virtue of being negatively unbounded. To 
derive this loss, it is helpful to interpret robustness in terms of a noise-correction procedure on loss functions. 


4 SLN-robustness: a noise-corrected loss perspective 

The dehnition of SLN-robustness (Equation 3) involves optimal scorers with the same loss £ over two different 
distributions. We now re-express this to reason about optimal scorers on the same distribution, but with two 
different losses. This will help characterise the set of losses that are SLN-robust. 

4.1 Reformulating SLN-robustness via noise-corrected losses 

Given any p G [0,1/2), Natarajan et al. [2013, Lemma 1] showed how to associate with a loss £ a noise- 
corrected counterpart £, such that for any D, L/^(s) = (s). The loss £ is dehned as follows. 

Definition 2 (Noise-corrected loss). Given any loss £ and p G [0,1/2), the noise-corrected loss £ is 

(Vy g (±1}) (W e B) Kv. V) = g- , g) : -L-g.” ). (4) 

Since £ depends on the unknown parameter p, it is not directly usable to design an SLN-robust learner. 
Nonetheless, it is a useful theoretical construct, since the risk equivalence between L^(s) and (s) means 
that for any T, minimisation of the £-risk on D over T is equivalent to minimisation of the f^risk on D over 
T, i.e. With this, we can re-express the SLN-robustness of a learner (£, T) as 

(VB G A) (VB G ATsiniB)) L?i(§f ^’*) = L?i(§f’^’*). (5) 

This reformulation is useful, because to characterise SLN-robustness of (£, T), we can now consider condi¬ 
tions on £ such that £ and its noise-corrected counterpart £ induce the same restricted Bayes-optimal scorers. 


4.2 Characterising a stronger notion of SLN-robustness 

Manwani and Sastry [2013, Theorem 1] proved a sufficient condition on £ such that Equation 5 holds, namely, 

(3CeM)(Vz;eK)£i(u)-f£_i(u) = C'. (6) 

Eor such a loss, £ is a scaled and translated version of £, so that trivially ’ ’ • 

Ideally, one would like to characterise when Equation 5 holds. While this is an open question, interest¬ 
ingly, we can show that under a stronger requirement on the losses £ and I, the condition in Equation 6 is also 
necessary. The stronger requirement is that the corresponding risks order all stochastic scorers identically. 
A stochastic scorer is simply a mapping f ■. X ^ Ar, where Ar is the set of distributions over the reals. In a 
slight abuse of notation, we denote the ^-stochastic risk of / by 


Lf(/) 


E 

(X.Y)~U 


E 

S-/(X) 


[£(Y,S)] 


Equipped with this, we dehne a notion of order equivalence of loss pairs. 
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Definition 3 (Order equivalent loss pairs). We say that a pair of losses {£, i) are order equivalent if 

(WD) (V/,5 e A^)Lf (/) < hf{g) ^ Lf (/) < Lf (5). 

Clearly, if two losses are order equivalent, their corresponding risks have the same restricted minimisers. 
Consequently, if (£, £) are order equivalent for every p € [0,1/2), this implies that ’ ’ for ^^y 

3“, which by Equation 5 means that for any 3^, the learner (£, tF) is SLN-robust. We can thus think of order 
equivalence of (£, £) as signifying strong SLN-robustness of a loss £. 

Definition 4 (Strong SLN-robustness). We say a loss £ is strongly SLN-robust if for every p G [0,1/2), (£, £) 
are order equivalent. 

We establish that the sufficient condition of Equation 6 is also necessary for strong SLN-robustness of £. 

Proposition 3. A loss £ is strongly SLN-robust if and only if it satisfies Equation 6. 

We now return to our original goal, which was to find a convex £ that is SLN-robust for IFiin (and ideally 
more general function classes). The above suggests that to do so, it is reasonable to consider as admissible 
those losses that satisfy Equation 6. Unfortunately, it is evident that if I is convex, non-constant, and bounded 
below by zero, then it cannot possibly be admissible in this sense. But we now show that removing the 
boundedness restriction allows for the existence of a convex admissible loss. 


5 The unhinged loss: a convex, classification-calibrated, strongly SLN- 
robust loss 

Consider the following simple, but non-standard convex loss: 

= 1 - ^ and rf (u) = l + v. 

A peculiar property of the loss is that it is negatively unbounded, an issue we discuss in §5.3. Compared to 
the hinge loss, the loss does not clamp at zero, i.e. it does not have a hinge. Thus, we call this the unhinged 
loss^. The loss has a number of attractive properties, the most immediate of which is its SLN-robustness. 

5.1 The unhinged loss is strongly SLN-rohust 

Since -f = 0 we conclude from Proposition 3 that is strongly SLN-robust, and thus that 

(£unh^ T) is SLN-robust for any choice of T. Lurther, the following uniqueness property is not hard to show. 

Proposition 4. Pick any convex loss £. Then, 

(3C S K) (’i(w)-f f_i(z;) = C {3A,B,D g'M.)£i(v) =-A-V-\-B,£_i(v) = A-V-\-D. 

That is, up to scaling and translation, is the only convex loss that is strongly SLN-robust. 

Returning to the case of linear scorers, the above implies that Tun) is SLN-robust. This does not 

contradict Proposition 1, since is not a convex potential as it is negatively unbounded. Intuitively, this 
property allows the loss to compensate for the high penalty incurred by instances that are misclassified with 
high margin by allowing for a high gain for instances that correctly classified with high margin. 

^This loss has been considered in Sriperumbudur et al. [2009], Reid and Williamson [2011] in the context of maximum mean 
discrepancy; see Appendix E.4. The analysis of its SLN-robustness is to our knowledge novel. 
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5.2 The unhinged loss is classification calibrated 

SLN-robustness is by itself insufficient for a learner to be useful. For example, a loss that is uniformly zero 
is strongly SLN-robust, but is useless as it is not classification-calibrated. Fortunately, the unhinged loss is 
classification-calibrated, as we now establish. For reasons that shall be discussed subsequently, we consider 
minimisation of the risk over 3^b = [~B, the set of scorers with range bounded by B S [0, oo). 

Proposition 5. Fix i = Then, for any Dm, 7 j, B € [0, c»), = {x ^ B ■ sign(277(x) — 1)}. 

Thus, for every B G [0, oo), the restricted Bayes-optimal scorer over has the same sign as the Bayes- 
optimal classifier for 0-1 loss. In the limiting case where T = K^, the optimal scorer is attainable if we oper¬ 
ate over the extended reals M U {±oo}, in which case we can conclude that is classification-calibrated. 

5.3 Enforcing boundedness of the loss 

While the classification-calibration of is encouraging. Proposition 5 implies that its (unrestricted) Bayes- 
risk is —oo. Thus, the regret of every non-optimal scorer s is identically -boo, which hampers analysis of 
consistency. In orthodox decision theory, analogous theoretical issues arise when attempting to establish 
basic theorems with unbounded losses [Ferguson, 1967, pg. 78], [Schervish, 1995, pg. 172]. 

We can side-step this issue by restricting attention to bounded scorers, so that is effectively bounded. 
By Proposition 5, this does not affect the classification-calibration of the loss. In the context of linear scorers, 
boundedness of scorers can be achieved by regularisation: instead of working with Tiin, one can instead use 
Selina = {x !-)■ (w,x) I ||w ||2 < l/^A}, where A > 0, so that Jiin,A C ^ ^ sup^g^ ||x|| 2 . 

Observe that restricting to bounded scorers does not affect the SLN-robustness of because T) is 

SLN-robust for any T. Thus, for example, Tun.x) is SLN-robust for any A > 0. As we shall see in 

§6.3, working with Tun.A also lets us establish SLN-robustness of the hinge loss when A is large. 

5.4 Unhinged loss minimisation on corrupted distribution is consistent 

Using bounded scorers makes it possible to establish a surrogate regret bound for the unhinged loss. This 
shows classification consistency of unhinged loss minimisation on the corrupted distribution. 

Proposition 6. Fix £ = Then, for any D, p G [0,1/2), B G [1, oo), and scorer s G 


regret^i(s) < regretf(s) = ^ ^ ' regretf(s). 


Standard rates of convergence via generalisation bounds are also trivial to derive; see Appendix D. We 
now turn to the question of how to minimise the unhinged loss when using a kernelised scorer. 

6 Learning with the unhinged loss and kernels 

We now show that the optimal solution for the unhinged loss when employing regularisation and kernelised 
scorers has a simple form. This sheds further light on SLN-robustness and regularisation. 

6.1 The centroid classifier optimises the unhinged loss 

Consider minimising the unhinged risk over some ball in a reproducing kernel Hilbert space IK with kernel 

fc, i.e. consider the function class of kernelised scorers ITjc,A = {s: xi-G- (ui,<h(x))jr | ||w||5r < l/A/A}for 

some A > 0, where $: X —> K is some feature mapping. Equivalently, given a distribution^ D, we want 



wdtK (X,Y)~D 


(7) 


^Given a training sample S ~ D", we can use plugin estimates as appropriate. 
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The first-order optimality condition implies that 




(S) 


Thus, the optimal scorer for the unhinged loss is simply 

Su„h,A ■ ^ ^ X ' (X '0-D ^ ^ X ’ ’ x5p - (1 - tt) • [k(X, a;)]^ . (9) 

That is, we score an instance based on the difference of the aggregate similarity to the positive instances, and 
the aggregate similarity to the negative instances. This is equivalent to a nearest centroid classifier [Manning 
et al., 2008, pg. 181] [Tibshirani et ah, 2002] [Shawe-Taylor and Cristianini, 2004, Section 5.1]. The quantity 
?ii*nh A interpreted as the kernel mean map of I?; see Appendix E for more related work. 

Equation 9 gives a simple way to understand the SLN-robustness of it is easy to establish 

(see Appendix C) that the optimal scorers on the clean and corrupted distributions only differ by a scaling, 
i.e. 

(VxGX) E [Y-A:(X,x)] = ^— • E [Y-/c(X,a:)l . (10) 

{X,Y)~D l-2p (X.Y)~P ^ 


6.2 Practical considerations 

We note several points relating to practical usage of the unhinged loss with kernelised scorers. Eirst, cross- 
validation is not required to select A, since s* depends trivially on the regularisation constant; changing 
A only changes the magnitude of scores, not their sign. Thus, regularisation simply controls the scale of the 
predicted scores, and for the purposes of classification, one can simply use A = 1. 

Second, we can easily extend the scorers to use a bias regularised with strength 0 < Ab ^ A. Tuning Xi, 
is equivalent to computing as per Equation 9, and tuning a threshold on a holdout set. 

Third, when 5f = for d small, we can store explicitly, and use this to make predictions. Eor 

high (or infinite) dimensional TC, we can make predictions directly via Equation 9. However, when learning 
with a training sample S ~ D", this would require storing the entire sample for use at test time, which is 
undesirable. To alleviate this, for a translation-invariant kernel one can use random Eourier features [Rahimi 
and Recht, 2007] to find an approximate embedding of TC into some low-dimensional and then store 
^unh A usual. Alternately, one can post hoc search for a sparse approximation to for example 

using kernel herding [Chen et al., 2012]. 

We now show that under some assumptions, coincides with the solution of two established meth¬ 

ods; Appendix E discusses some further relationships, e.g. to the maximum mean discrepancy. 


6.3 Equivalence to a highly regularised SVM and other convex potentials 

There is an interesting equivalence between the unhinged solution and that of a highly regularised SVM. 
Proposition 7. Pick any D and $: X —)■ such that R = sup^.^^ I IJC < oo- For any A > 0, let 

<inge.A = argmin E [max(0,1 - Y • (w, $(a;)) -h ^ {w, u;)^ 

(X,Y)~/A Z 

be the soft-margin SVM solution. Then, if X > R^, ui^inge a = ^unh A- 

Since we know that Tm.a) is SLN-robust, it follows immediately that for : (y, v) i—)■ max(0,1— 
yv), (£‘^“8^TK,A) is similarly SLN-robust provided X is sufficiently large. That is, strong £2 regularisation 
(and a bounded feature map) endows the hinge loss with SLN-robustness^. 

Proposition 7 can be generalised to show that with sufficiently strong regularisation, the limiting solution 
of any twice differentiable convex potential will be the unhinged solution, i.e. , the centroid classifier. Intu¬ 
itively, with strong regularisation, one only considers the behaviour of a loss near zero; but since a convex 

contrast, Long and Servedio [2010, Section 6] establish that regularisation does not endow SLN-robustness. 
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potential (p has (/)'(0) < 0, it will be well-approximated by the unhinged loss near zero (which is simply 
the linear approximation to </>). This shows that strong £2 regularisation endows most learners with SLN- 
robustness. 


Proposition 8. Pick any D, bounded feature mapping <i> : X —> Tf, and twice differentiable convex potential 
(j). Let w*^ ^ be the minimiser of the regularised (p risk. Then, 


(Ve > 0) (3Ao > 0) (VA > Ao) 




"^unh.A 

ll'^J.Alljf 

IlKnh.Alk 


6.4 Equivalence to Fisher Linear Discriminant with whitened data 

Recall that for binary classification on DM,r], the Fisher Linear Discriminant (FLD) finds a weight vector 
proportional to the minimiser of square loss : {y, u) 1 —>■ (1 — yv)"^ [Bishop, 2006, Section 4.1.5], 

<q.A = (Ex..m[XX^] + A/)-i • E(x,y)-.d[Y • X]. (11) 

By Equation 10, and the fact that the corrupted marginal M = M, we see that is only changed by a 

scaling factor under label noise. This provides an alternate proof of the fact that Tun) is SLN-robust^ 

[Manwani and Sastry, 2013, Theorem 2]. 

Clearly, the unhinged loss solution is equivalent to the FLD and square loss solution w*^ ^ when 

the input data is whitened i.e. E [XX^] = I. With a well-specified T, e.g. with a universal kernel, both the 

unhinged and square loss asymptotically recover the optimal classiher, but the unhinged loss does not require 
a matrix inversion. With a misspecihed T, one cannot in general argue for the superiority of the unhinged loss 
over square loss, or vice-versa, as there is no universally good surrogate to the 0-1 loss [Reid and Williamson, 
2010, Appendix A]; Appendix F, Appendix G illustrate examples where both losses may underperform. 


7 SLN-robustness of unhinged loss: empirical illustration 


We now illustrate that the SLN-robustness of the unhinged loss is empirically manifest. We reiterate that 
with high regularisation, the unhinged solution is equivalent to an SVM (and in the limit to any classification- 
calibrated loss) solution. Thus, the experiments do not aim to assert that the unhinged loss is “better” than 
other losses, but rather, to demonstrate that its SLN-robustness is not purely theoretical. 

We hrst show that the unhinged risk minimiser performs well on the example of Long and Servedio 
[2010]. Figure 1 shows the distribution D, where X = {(1,0), (7, 57 ), (7, —7)} C with marginal 
distribution M = {7,7,5} and all three instances are deterministically positive. We pick 7 = 1/2. From 
Figure 1, we see the unhinged minimiser perfectly classihes all three points, regardless of the level of label 
noise. The hinge risk minimiser is perfect when there is no label noise, but with even a small amount of label 
noise, achieves an error rate of 50%. 





Hinge 

i-logistic 

Unhinged 

P ^ 

0 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

P ^ 

0.1 

0.15 ± 0.27 

0.00 ± 0.00 

0.00 ± 0.00 

P ^ 

0.2 

0.21 ± 0.30 

0.00 ± 0.00 

0.00 ± 0.00 

P ^ 

0.3 

0.38 ±0.37 

0.22 ± 0.08 

0.00 ± 0.00 

P ^ 

0.4 

0.42 ± 0.36 

0.22 ± 0.08 

0.00 ± 0.00 

P ^ 

0.49 

0.47 ± 0.38 

0.39 ± 0.23 

0.34 ±0.48 


Table 1: Mean and standard deviation of the 0-1 error 
over 125 trials on Long and Servedio [2010]. Grayed 
cells denote the best performer at that noise rate. 


Figure 1: Long and Servedio [2010] dataset. 

^Square loss escapes the result of Long and Servedio [2010] since it is not monotone decreasing. 
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We next consider minimisers of the empirical risk from a random training sample: we construct a training 
set of 800 instances, injected with varying levels of label noise, and evaluate classification performance on 
a test set of 1000 instances. We compare the hinge, f-logistic (for t — 2) [Ding and Vishwanathan, 2010] 
and unhinged minimisers. For each loss, we use a linear scorer without a bias term, and set the regularisation 
strength A = 10“^®. From Table 1, it is apparent that even at 40% label noise, the unhinged classifier is able 
to find a perfect solution. By contrast, both other losses suffer at even moderate noise rates. 

We next report results on some UCI datasets, where we additionally tune a threshold so as to ensure 
the best training set 0-1 accuracy. Table 2 summarises results on a sample of four datasets. (Appendix H 
contains results with more datasets, performance metrics, and losses.) While the unhinged loss is sometimes 
outperformed at low noise, it tends to be much more robust at high levels of noise: even at noise close to 
50%, it is often able to learn a classifier with some discriminative power. 


Hinge i-Logistic Unhinged 


Hinge i-Logistic Unhinged 


p 

= 0 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

p 

^ 0.1 

0.01 ± 0.03 

0.01 ± 0.03 

0.00 ± 0.00 

p 

^ 0.2 

0.06 ±0.12 

0.04 ± 0.05 

0.00 ± 0.01 

p 

^ 0.3 

0.17 ± 0.20 

0.09 ±0.11 

0.02 ± 0.07 

p 

= 0.4 

0.35 ± 0.24 

0.24 ±0.16 

0.13 ±0.22 

p 

= 0.49 

0.60 ± 0.20 

0.49 ± 0.20 

0.45 ± 0.33 


(a) iris. 


Hinge i-Logistic Unhinged 


P ^ 

0 

0.05 ± 0.00 

0.05 ± 0.00 

0.05 ± 0.00 

P ^ 

0.1 

0.06 ± 0.01 

0.07 ± 0.02 

0.05 ± 0.00 

P ^ 

0.2 

0.06 ± 0.01 

0.08 ± 0.03 

0.05 ± 0.00 

P ^ 

0.3 

0.08 ± 0.04 

0.11 ± 0.05 

0.05 ± 0.01 

P ^ 

0.4 

0.14± 0.10 

0.24 ± 0.13 

0.09 ±0.10 

P ^ 

0.49 

0.45 ± 0.26 

0.49 ±0.16 

0.46 ± 0.30 


(b) housing. 


Hinge t-Logistic Unhinged 


P 

= 0 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

P 

^ 0.1 

0.10 ±0.08 

0.11 ± 0.02 

0.00 ± 0.00 

P 

^ 0.2 

0.19 ± 0.11 

0.15 ± 0.02 

0.00 ± 0.00 

P 

^ 0.3 

0.31 ± 0.13 

0.22 ± 0.03 

0.01 ±0.00 

P 

= 0.4 

0.39 ± 0.13 

0.33 ± 0.04 

0.02 ± 0.02 

P 

= 0.49 

0.50 ±0.16 

0.48 ± 0.04 

0.34 ± 0.21 


P ^ 

0 

0.05 ± 0.00 

0.04 ± 0.00 

0.19 ±0.00 

P ^ 

0.1 

0.15 ± 0.03 

0.24 ± 0.00 

0.19 ±0.01 

P ^ 

0.2 

0.21 ± 0.03 

0.24 ± 0.00 

0.19 ±0.01 

P ^ 

0.3 

0.25 ± 0.03 

0.24 ± 0.00 

0.19 ±0.03 

P ^ 

0.4 

0.31 ± 0.05 

0.24 ± 0.00 

0.22 ± 0.05 

P ^ 

0.49 

0.48 ± 0.09 

0.40 ± 0.24 

0.45 ± 0.08 


(c) uspsOvV. 


(d) splice. 


Table 2: Mean and standard deviation of the 0-1 error over 125 trials on UCI datasets. 


8 Conclusion and future work 

We have proposed a convex, classification-calibrated loss, proved that is robust to symmetric label noise 
(SLN-robust), shown it is the unique loss that satisfies a notion of strong SLN-robustness, established that it 
is optimised by the nearest centroid classifier, and also shown how the nature of the optimal solution implies 
that most convex potentials, such as the SVM, are also SLN-robust when highly regularised. Future work 
includes studying losses robust to asymmetric noise, and outliers in instance space. 
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Proofs for “Learning with Symmetric Label Noise: The 
Importance of Being Unhinged” 

A Proofs of results in main body 

We now present proofs of all results in the main body. 

Proof of Proposition 1. This result is stated implicitly in Long and Servedio [2010, Theorem 2]; the aim of 
this proof is simply to make the result explicit. 

Let X = {(1, 0), (7, 57), (7, —7), (7, —7)} C for some 7 < 1/6. Let the marginal distribution over 
X be uniform. Let 77 : a: 1 —> 1, i.e. let every example be deterministically positive. 

Now suppose we observe some SLN{D, p), for p G [0,1/2). We minimise the f-risk some convex 
potential £: {y, v) 1 —>■ (j){y, v) using a linear function class® Tun. Then, Long and Servedio [2010, Theorem 
2] establishes that 

(Vse§p->*)Ci(s) = i. 

On the other hand, since D is linearly separable and a convex potential £ is classification-calibrated, we must 
have = 0. Consequently, for any convex potential £, {£, Tun) is not SLN-robust. □ 

Proof of Proposition 2. Let p be the class-probability function of D. By [Natarajan et al., 2013, Lemma 7], 

(Va; e X) sign(2p(a;) — 1) = sign(277(x) — 1), 

so that the optimal classifiers on the clean and corrupted distributions coincide. Therefore, intuitively, if the 
Bayes-optimal solution for loss recovers sign(2p(a;) — 1), it will also recover sign(277(a;) — 1). Formally, 
since £ is classification-calibrated, for any D G A, and s G S^’* 

(Vx G X) sign(s(a;)) = sign(2?7(x) — 1), 

and similarly, for any D G A4in(Ll), and s € Sf’* 

(Vx G X) sign(s(a;)) = sign(2p(a;) — 1). 

Thus, for any D, D, since the 0-1 risk of a scorer depends only on its sign, 

= L^i(sign(2p-l)) 

= L^i(sign(2p- 1)) 


Consequently, i.e. {£, is SLN-robust. □ 

Proof of Proposition 3. ( 7= ). If ^ satisfies Equation 6, then its noise corrected counterpart is 

(Vy e {±l})(Vt; G K)f(y,t;) = -C ■ 

that is, it is a scaled and translated version of £. Consequently, for any p, the corresponding risk will be a 
scaled and translated version of the ^-risk. It is immediate that the two losses will be order equivalent for any 
P- 

( =7 ). Recall that S denotes the distribution of scores. For any stochastic scorer /, let 

Sf. ai-G P(S = a) 

®The result actually requires that one not include a bias term; with a bias term, it can be checked that the example as-stated has a 
trivial solution. 


11 





be the corresponding marginal distribution of scores. Similarly, let 

Ma : X I—>■ P(X = X I S = a) 

be the conditional distribution of instances given a predicted score a € K. Finally, for any a € K, let 
Da = {Ma,ri) be an induced distribution over X x {±1}. 

With the above, we can rewrite the stochastic risk as 


Lf(/) 


E 

S'-Sf 


E 

(X.Y)~Ds 


[£(Y,S)] 


E 

S-S/ 


Lf(S) 


That is, we average, over all achievable scores according to /, the risks of that constant prediction on an 
appropriately reweighed version of the original distribution D. Then, for some fixed p G [0,1/2), the fact 
that £ and £ are order equivalent can be written 


(Vi?) (V/,pG 


E 


Lf(S) 


S~S/ L 

Now dehne the utility functions 


< E 

L 


Lf(S) 


E 
S^S, L 


L?^(S) 


< E 

L 


L?^(S) 


U^-. -Lf“(a) 


and 


Then, order equivalence can be trivially re-expressed as 


(Vi?)(V/,pGA^) E [C/^(S)]> E [17^(S)] 


^E^^ [F^(S)] > ^E^^ [V^iS)] . 


That is, for any fixed distribution D, the utility functions specify the same ordering over distri¬ 

butions in Ar. Therefore, by DeGroot [1970, Section 7.9, Theorem 2], for any fixed D, they must be affinely 
related: 

(Vi?) (3a, /3 e K) (Va G M) (a) = a • (a) -f f3. 

Converting this back to losses, and using the definition of strong SLN-robustness, 

(Vi?) (Vp G [0,1/2)) (3a,/3 G K) (Va G M)Lf“(a) = a • L?“(a) -f /3 


or 

(Vi?)(VpG [0,1/2)) (3a, ^GM)(VaGK) E [£(Y, a) - (a • i(Y, a)-f/3)] = 0. 

(X,Y)~Da 

For this to hold for all possible D, it must be true that 

(Vp G [0,1/2)) (3a, ^ G M) (Vj/, v) £{y, v) = a ■ £{y, v) + P. 


By Lemma 9, the result follows. 


□ 


Proof of Proposition 4. ( <^= ). Clearly for an £ satisfying the given condition, £i{v) + £-i{v) = B + C, a 
constant. 

( ). By assumption, £i is convex. By the given condition, equivalently, (3C G K) C — £i is convex. 

But this is in turn equivalent to —£i also being convex. The only possibility for both £\ and —£i being convex 
is that is affine, hence showing the desired implication. □ 

Proof of Proposition 5. Fix £ = f™*'. It is easy to check that 

(Vp G [0,1]) (Vu G K) u) = (1 - 2r]) ■ v, (12) 


and so 


(VpGp, 1 ]) argmin L^{p,v) 
vG [—B,+B] 


+B if 77 > 1 
—B else. 


□ 
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Proof of Proposition 6. Fix £ = Since by Equation 12 L^unh ( 77 , u) = (1 — 2rj) ■ v, we have that 

Lf(s) = -^E^[(277(X)-l).s(X)], 

and since the restricted Bayes-optimal scorer is x <—>■ B ■ sign{2ri{x) — 1), 

= E [|277(X)- 1|]. 

X~M 


Thus, 

regretf(s) = ^E^ [|277(X) - 1| • (B - s(X) • sign(2?7(X) - 1))] 

Now, since the scorer x !->■ sign(277(x) — 1) S §q{* n 3^b, we have that regret^^’^® (s) = regret^(s). 
Further, we have that 

regret^i(s) = E [|277(X) - 1| • |s(X) • sign(2p(X) - 1) < 0|]. 

X~M 

But if B > 1, 

|t; < 0] < B — u. 

Thus, 

regret^f^s) < regretf(s). 

Finally, by Equation 4, for i = 

(Vj/ e {±1}) (Vu e K) I{y, v) = 

i.e. the unhinged loss is its own noise-corrected loss, with a scaling factor of Thus, since the f-regret 

on B and f-regret on D coincide, 

regretf’^^(s) = regret?’^^(s) = • regretf(s). 

□ 


Proof of Proposition 7. On a distribution D, a soft-margin SVM solves 


min E [max(0,1 — Y • (w, <i)(x)) 5 t)l-f — (tu, w) 
(x.Y)~_D ^ ^ ^ 2' ’ ^ 


2 

5C- 


Let wJjjjgg denote the optimal solution to this objective. Now, by Shalev-Shwartz et al. [2007, Theorem 1], 

ll^i'hinge.AllM < 

Now suppose that R = sup^g^j; \Mx)\\ < 00 . Then, by the Cauchy-Schwartz inequality, 

(Va; e X) |«inge.A.®(2;))M| < IKinge.AlK ’ ll^(2;)|k < 

It follows that if A > B^, then 

(VX G X) |«i„ge.A: < 1- 

But this means that we never activate the flat portion of the hinge loss. Thus, for A > B^, the SVM objective 
is equivalent to 

min E [1 - Y • (w, $(a:))j<:]-f 

u)GM(X,Y)~D 2 

which means the optimal solution will coincide with that of the regularised unhinged loss. Therefore, we can 
view unhinged loss minimisation as corresponding to learning a highly regularised SVM^. □ 

^This also holds if we add a regulaiised bias term. With an unregularised bias term, Bedo et al. [2006] showed that the limiting 
solution of a soft-margin SVM is distribution dependent. 
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Proof of Proposition 8. Fix some distribution D. Let 


P = 


E 

(X,Y)~D 


[Y . $(X)] 


be the optimal unhinged solution with regularisation strength A = 1. Observe that 
oo. For some r > 0, let 

= argmin Lf {w) 

\M\lK<r 

be the optimal f solution with norm bounded by r. Similarly, let 


^unh — 



M < i? = sup^gxll®(a;)lk < 


be the optimal unhinged solution with the same norm as the optimal f solution. We will show that these two 
vectors have similar unhinged risks, and use this to show that the corresponding unit vectors must be close. 

By definition, a convex potential has (j)'{0) < 0. Without loss of generality, we can scale the potential so 
that (j)'{0) = —1. Then, since f is convex, it is lower bounded by the linear approximation at zero: 

(Vu G M) cj){v) — (j){0) > —V. 

Observe that the RHS is the unhinged loss. Thus, the unhinged risk can be bounded by its (p counterpart. In 
particular, at the optimal </> solution. 


Lu„hK)<L^K)-</-(0). 

Therefore, the difference between the unhinged and f optimal solutions is 


]L'unh('tU0) Liinh(tt>unh) — ^I-'unh (tCunh) '/'(O) 

< — Lunh('li'unh) ~ 4’i^) 

= E ci>(X)),c) + Y«„h, $(X))j^] - .^(0) 


= E 
(x,Y)~n L 




(13) 


where f-. v ^ fiy) — (j){0) + v. (The second line follows by definition of optimality of w'^ amongst all 
vectors with norm bounded by r.) We have already established that ^ > 0. Now, by Taylor’s remainder 
theorem, 

(Vu e (-l,l))^(u) < (14) 

where a = max„g[_i 4>"{v). But by Cauchy-Schwartz, we can restrict attention in Equation 13 to the 
behaviour of f in the interval 

I = hlKnhlk • IKnhIk • 

where R = sup^g 3 (; mx)\\ jc < oo. Therefore, if r < Equation 14 and a further application of Cauchy- 
Schwartz yield 

L,„h(u^;) - Lunh«„h) < “ • [«nh, ‘&(X))^)] 

< “.^E^[|K„hlk.||$(X)|k] 

<^- IKnhIk- 


Now, the unhinged risk is 
Thus, 


Eunh(^i') = -(w,F)m- 

-(K’kjC + Knluklt < ^ ■ IKnhIk- 
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Rearranging, and by definition of w*nij 


= IKIk-|lMlk-“f-IKII^ 

= IKIk-llMlk- 

>IKIk-IWh-(i-2^-). 

where the last line is since by definition | IwJI [jr < r. Thus, for e = 2 \\l!\\^ ’ 

/ K ^ \ ^ 1 

It follows that the two unit vectors can be made arbitrarily close to each other by decreasing r. Since this 
corresponds to increasing the strength of regularisation (by Lagrange duality), and since corresponds 

to the normalised unhinged solution for any regularisation strength, the result follows. 

□ 


A.l Additional helper lemmas 

Lemma 9. Pick any loss 1. Suppose that 

(Vp e [0,1/2)) (3a, /3 G K) (Vj/, v) i{y, v) = a ■ I{y, v) + /3. 


Then, 


(3C G M) (Vu G M) fi(u) + t-x{v) = C. 

Proof of Lemma 9. By the definition of the noise-corrected loss (Equation 4), the given statement is that there 
exist a,/3: [0,1/2)—>■ K with 


(Vp G [0,1/2)) (Vu G 
Expanding out the matrix inverse, 

(VpG [0,1/2)) (Vug 
A dding together the two sets of equations. 


fi(u) 

£_i(u) 


= ot{p) 


'l-p 

p 

-1 

■ fi(u) ■ 

p 

l-p 


£_i(u)_ 


fi(u) 

a(p) 

l-p -p 


£i{v) 

f-iiv)_ 

1 - 2p 

-p l-p 


£-iiv)_ 


Pip)- 


■Pip)- 


(Vp G [0,1/2)) (Vu G K)fi(u) +f_i(u) = a(p) • (£i(u) + f_i(u)) -P Pip), 


pfp G [0,1/2)) (Vu G K) (1 - a(p)) • (4(u) + f_i(u)) = /?(p). 

Since the RHS is independent of v, the LHS cannot depend on v, i.e. £i (u) + f-i (u) must be a constant. □ 
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Additional Discussion for “Learning with Symmetric 
Label Noise: The Importance of Being Unhinged’” 

B Evidence that non-convex losses and linear scorers may not be SEN - 
robust 

We now present evidence that for ^ being the TangentBoost loss, 

(■{y,v) = {2i&Tr^{yv) - if, 
or the f-logistic regression loss for t = 2, 

i{y, v) = log(l -yv + \/l + t;2), 

(£, IFiin) is not SLN-robust. We do this by looking at the minimisers of these losses on the 2D example 
of Long and Servedio [2010]. Of course, as these losses are non-convex, exact minimisation of the risk is 
challenging. However, as the search space is we construct a grid of resolution 0.025 over [—10,10]^. We 
then exhaustively compute the objective for all grid points, and seek the minimiser. 

We apply this procedure to the Long and Servedio [2010] dataset with 7 = and with a 30% noise 
rate. Figure 2 plots the results of the objective for the TangentBoost loss. We find that the minimiser is at 
w* = (0.2,1.3). This results in a classifier with error rate of ^ on D. Similarly, from Figure 3, we find that 
the minimiser is w* = (1.025, 5.1), which also results in a classifier with error rate of f 
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Figure 2: Risk values for various weight vectors w = {wi,W 2 ), TangentBoost, Long and Servedio [2010] 
dataset. 

The shape of these plots suggests that the minimiser is indeed found in the interval [—10,10]^. To further 
verify this, we performed L-BFGS minimisation of these losses using 100 different random initialisations, 
uniformly from [—100,100]^. We find that in each trial, the TangentBoost solution converges to w* = 
(0.2122,1.3031), while the f-logistic solution converges to w* = (1.0372, 5.0873), both of which result in 
accuracy of ^ on H. 

B.l In defence of non-convex losses: beyond SLN-robustness 

The above illustrates the possible non SLN-robustness of two non-convex losses. However, there may be 
other notions under which these losses are robust. For example. Ding and Vishwanathan [2010] defines 
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Figure 3: Risk values for various weight vectors w = {wi^w^), f-logistic regression. Long and Servedio 
[2010] dataset. 


robustness to be a stability of the asymptotic maximum likelihood solution when adding a new labelled 
instance (chosen arbitrarily from X x {±1}), based on a definition in O’Hagan [1979]. Intuitively, this 
captures robustness to outliers in the instance space, so that e.g. an adversarial mislabelling of an instance far 
from the true decision boundary does not adversely affect the learned model. Such a notion of robustness is 
clearly of practical interest, and future study of such alternate notions would be of value. 

B.2 Conjecture: (most) strictly proper composite losses are not SLN-robust 

More abstractly, we conjecture the above can be generalised to the following. Recall that a loss £ is strictly 
proper composite [Reid and Williamson, 2010] if its (unique) Bayes-optimal scorer is some strictly monotone 
transformation tp of the class-probability function: (VD) Sf’* = {tp o rj}. 

Conjecture 1. Pick any strictly proper composite (but not necessarily convex) I whose link function has 
range K. Then, (£, IFiin) is not SLN-robust. 

We believe the above is true for the following reason. Suppose D is some linearly separable distribution, 
with ry: cc I—> |(ti;*,a;) > 0] for some w*. Then, minimising t with Tun will be well-specified: the Bayes- 
optimal scorer is x) > 0]). If the range of ip is K, then this is equivalent to oo • (2|(ui*, x) > 0] — 1), 

which is in Tun if we allow for the extended reals. The resulting classifier will thus have 100% accuracy. 

However, by injecting any non-zero label noise, minimising £ with Tiin will no longer be well-specified, 
as fj takes on the values {1 — p,p}, which cannot be the sole set of output scores for any linear scorer if 
|X| > 3. We believe it unlikely that every such misspecified solution have 100% accuracy on D. We further 
believe it likely that one can exhibit a scenario, possibly the same as the Long and Servedio [2010] example, 
where the resulting solution has accuracy 50%. 

Two further comments are in order. First, if a loss is strictly proper composite, then it cannot satisfy 
Equation 6, and hence it cannot be strongly SLN-robust. (However, this does leave open the possibility that 
with Tiin, the loss is SLN-robust.) Second, observe that the restriction that ip have range K is necessary to 
rule out cases such as square loss, where the link function has range [—1,1]. 
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C Preservation of mean maps 

Pick any D, and p € [0,1/2). Then, 

(Vx € X) 2f](x) — 1 = 2 ■ ((1 — 2p) ■ r]{x) + p) — \ 

= (1 - 2p) ■ { 2 t ]{ x ) - 1). 

Thus, for any feature mapping $: X —> TC, the kernel mean map of the clean distribution is 

E [Y • $(X)1 = E \(2r](X) - 1) • $(X)1 

(X,Y)~D X~M 

which is a scaled version of the kernel mean map of the noisy distribution. That is, the kernel mean map is 
preserved under symmetric label noise. Instantiating the above with a specific instance x G 'X gives Equation 
10 . 

D Additional theoretical considerations 
D.l Generalisation bounds 

Generalisation bounds are readily derived for the unhinged loss. For a training sample S ^ D^, define the 
^-deviation of a scorer s: X —>^ K to be the difference in its population and empirical £-risk, 

devf’^(s) =L|^(s) -Lf(s). 

This quantity is of interest because a standard result says that for the empirical risk minimiser s„ over some 
function class IP, regret^’^(sn) < 2-supsggr |dev^’^(s)| [Boucheron et al., 2005, Equation 2]. For unhinged 
loss, we have the following Rademacher based bound. 

Proposition 10. Pick any D and n £ N+. Let S ~ denote an empirical sample. For some B £ K_|_, let 
s £ Xb- Then, with probability at least 1 — (5 over the choice of S, for i = 


devf’^(s) < 2-3i„(JB,S) + S- 
where S) is the empirical Rademacher complexity ofS^B on sample S. 

Proof of Proposition 10. The standard Rademacher-complexity generalisation bound [Bartlett and Mendel- 
son, 2002, Theorem 7], [Boucheron et al., 2005, Theorem 4.1] states that with probability at least 1 — i5 over 
the choice of S, 

devf ^(s) < 2 • ||(f)'|U • + ||f|| 

For the unhinged loss, ||(f'™^)'||oo = 1- Further, since we work over bounded scorers, ||f™^||oo = B. The 
result follows. □ 

Proposition 10 holds equally when learning from a corrupted sample S ~ Z)”. Since regret^/^(s„) = 

■ regretby Proposition 6, by minimising the unhinged loss on the corrupted sample, we can 
bound the regret on the clean distribution. 

E Additional relations to existing methods 

We discuss some further connections of the unhinged loss to existing methods. 
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E.l Unhinging the SVM 

We can motivate the unhinged loss intuitively by studying the noise-corrected versions of the hinge loss, as 
per Equation 4. Figure 4 shows the noise corrected hinge loss for p G {0,0.2, 0.4}. We see that as the noise 
rate increases, the effect is to slightly unhinge the original loss, by removing its flat portion*. Thus, if we 
knew the noise rate p, we could use these slightly unhinged losses to learn. 



V 


Figure 4: Noise-corrected versions of hinge loss, £i{v) = max(0,1 — v). Best viewed in colour. 

Of course, in general we do not know the noise rate. Further, the slightly unhinged losses are non-convex. 
So, in order to be robust to an arbitrary noise rate p, we can completely unhinge the loss, yielding 

= 1 - and rf (u) = 1 -f u. 

E.2 Relation to centroid classifiers 

As established in §6.1, the optimal unhinged classifier (Equation 9) is equivalent to a centroid classifier, 
where one replaces the positive and negative classes by their centroids, and performs classihcation based 
on the distance of an instance to the two centroids. Such a classifier has been proposed as a prototypical 
example of a simple kernel-based classiher [Scholkopf and Smola, 2002, Section 1.2], [Shawe-Taylor and 
Cristianini, 2004, Section 5.1] Balcan et al. [2008, Dehnition 4] considers such classihcation rules using 
general similarity functions in place of kernels corresponding to an RKHS. 

The optimal unhinged classiher is also closely related to the Rocchio classiher in information retrieval 
[Manning et al., 2008, pg. 181], and the nearest centroid classiher in computational genomics [Tibshirani 
et al., 2002]. The optimal kernelised scorer for these approaches is [Doloc-Mihu et al., 2003] 

(^E^[k{X,x)]- E^[kiX,x)]y 

i.e. it does not weight each of the kernel means. 

E.3 Relation to kernel density estimation 

When working with an RKHS with a translation invariant kernel®, the optimal unhinged scorer (Equation 9) 
can be interpreted as follows: perform kernel density estimation on the positive and negative classes, and 
then classify instances according to Bayes’ rale. For example, with a Gaussian RBF kernel, the classiher is 
equivalent to using a Gaussian kernel to compute density estimates of P, Q, and using these to classify. This 
is known as a kernel classihcation rale [Devroye et al., 1996, Chapter 10]. 

* Another interesting observation is that these noise-corrected losses are negatively unbounded - that is, minimising hinge loss on D 
is equivalent to minimising a negatively unbounded loss on D. This is another justification for studying negatively unbounded losses. 

^For a general (not necessarily translation invariant) kernel, this is known as a potential function rule [Devroye et al., 1996, §10.3]. 
The use of “potential” here is distinct from that of a “convex potential”. 
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This perspective suggests that in computing we may also estimate the corrupted class-probability 

E [fc(X,a:)] 

function. In particular, observe that if we compute ’ similar to the Nadaraya-Watson estimator 

X~Q ' ’ ' 

[Bishop, 2006, pg. 300], then this provides an estimate of Of course, such an approach will succumb 

to the curse of dimensionality 

An alternative is to use the Probing reduction [Langford and Zadrozny, 2005], by computing an ensemble 
of cost-sensitive classifiers at varying cost ratios. To this end, observe that the following weighted unhinged 
(or whinge) loss, 

^whi„ge(^) ^ 

^whinge(^) ^ ^ 

for some c_i S [0,1] and Ci = 1 — c_i, will have a restricted Bayes-optimal scorer of B ■ sign(? 7 (a;) — c_i) 
over 3^b- Further, it will result in an optimal scorer that simply weights each of the kernel means, 

Swhinge.A ' ^ T ’ ^ E [cy ’ Y • fc(X, x)] , 

making it trivial to compute as c is varied. 

E.4 Relation to the MMD witness 

The optimal weight vector for unhinged loss (Equation 8 ) can be expressed as 

<nh,A = Hq), 

where fjLp and pg are the kernel mean maps with respect to Jf of the positive and negative class-conditionals 
distributions, 

^P=E^[^{X)] 

PQ = [$(X)]. 

When TT = |, llwtlljt precisely the maximum mean discrepancy (MMD) [Gretton et al., 2012] between 
P and Q, using all functions in the unit ball of Jf. The mapping x i— {wi,x)^ itself is referred to as the 
witness function [Gretton et al., 2012, §2.3]. While the motivation of MMD is to perform hypothesis testing 
so as to distinguish between two distributions P, Q, rather than constructing a suitable scorer, the fact that it 
arises from the optimal scorer for the unhinged loss has been previously noted [Sriperumbudur et al., 2009, 
Theorem 1]. 


F Example of poor classification with square loss 


We illustrate that square loss with a linear function class may perform poorly even when the underlying 
distribution is linearly separable. We consider the dataset of Long and Servedio [2010], with no label noise. 
That is, we have X = {(1, 0), ( 7 , 67 ), ( 7 , — 7 ), ( 7 , — 7 )} C and rj-. x ^ Let X G be the feature 
matrix of the four data points. Then, the optimal weight vector learned by square loss is 


w 


* 




1 

1 

1 

1 


87-1-3 

872-1-3 

7+1 

37.(872-1-3) 


*®This refers to the rate of convergence of the estimate of r; to the true rj. By contrast, generalisation bounds establish that the rate 
of convergence of the estimate of the corresponding classifier to the Bayes-optimal classifier sign(2)7(a;) — 1) is independent of the 
dimension of the feature space. 
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It is easy to check that the predicted scores are then 


s 




87+3 

872+3 

7-(87+3) 

8 2+3_5 i ( t ;-1)7 

’ ^ 247^+97 

(7-l)-7 

247 ^+ 97 + 2 ^^ 

(7-l)-7 


247^+97 + 


7-(87 + 3) 

87^+3 . 


But for 7 < this means that the predicted scores for the last two examples are negative. That is, the result¬ 
ing classifier will have 50% accuracy. (This does not contradict the robustness of square loss, as robustness 
simply requires that performance is the same with and without noise.) 

It is initially surprising that square loss fails in this example, as we are employing a linear function class, 
and the true 77 is expressible as a linear function. However, recall that the Bayes-optimal scorer for square 
loss is 

§^’* = {s: a; I—)■ 2r]{x) — 1}. 

In this case, the Bayes-optimal scorer is 


s*: X 1-^ 2|a;i > 0] — 1. 

The application of a threshold means the that scorer is not expressible as a linear model. Therefore, the 
combination of loss and function class is in fact not well-specified for the problem. 

To clarify this point, consider the use of the squared hinge loss, £{y,v) = max(0,1 — yv)^. This loss 
induces a set of Bayes-optimal scorers, which are; 

f = 1 s{x) e [ 1 , 00 ) 1 

Sf'* = < s I (Va; G T) < r]{x) G (0,1) s{x) = 2r]{x) ~ 1 r 

[ i?7(a^)=0 s(x) G (- 00 ,1]. J 

Crucially, we can find a linear scorer that is in this set: for, say, v = (i, 0), we clearly have {v,x) > 1 for 
every a; G X, and so this is a Bayes-optimal scorer. Thus, minimising the square hinge loss on this distribution 
will indeed find a classifier with 100% accuracy. 

G Example of poor classification with unhinged loss 

We illustrate that the unhinged loss with a linear function class may perform poorly even when the underlying 
distribution is linearly separable. (For another example where instances are on the unit ball, see Balcan 
et al. [2008, Figure 1].) Consider a distribution DM,ri uniformly concentrated on X = {xi,X 2 ,X 3 } with 
xi = (1, 2), 0:2 = (1, —4), 0:3 = (—1, 1), with r]{xi) = 77(2:2) = 1 and 77(2:3) = 0, i.e. the first two instances 
are positive, and the third instance negative. Then it is evident that the optimal unhinged hyperplane, with 
regularisation strength 1, is w* = (1, —1). This will misclassify the first instance as being negative. Figure 5 
illustrates. 

It is easy to check that for this particular distribution, the optimal weight for square loss is w* = (1,0). 
This results in perfect classification. Thus, we have a reversal of the scenario of the previous section - here, 
square loss classifies perfectly, while the unhinged loss classifies no better than random guessing. 

It may appear that the above contradicts the classification-calibration of the unhinged loss: there certainly 
is a linear scorer that is Bayes-optimal over Tb, namely, w* = {B,Q). The subtlety is that in this case, 
minimisation over the unit ball || 7 u ||2 < 1 (as implied by £2 regularisation) is unable to restrict attention to 
the desired scorer. 

There are two ways to rectify examples such as the above. First, as in general, we can employ a suitably 
rich kernel, e.g. a Gaussian RBF kernel. It is not hard to verify that on this dataset, such a kernel will find 
a perfect classifier. Second, we can look to explicitly enforce that minimisation is over all w satisfying 
I {w, Xn) I < 1- This will result in a linear program (LP) that may be solved easily, but does not admit a closed 
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Figure 5: Example of linearly separable distribution where, when learning with the unhinged loss and a linear 
function class, the resulting hyperplane (in red) misclassifies one of the instances. 

form solution as in the case of minimising over the unit ball. It may be checked that the resulting LP will 
recover the optimal weight w* = (1,0). While this approach is suitable for this particular example, issues 
arise when dealing with infinite dimensional feature mappings (as we lose the existence of a representer 
theorem without regularisation based on the norm in the Hilbert space [Yu et al., 2013]). 
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Additional Experiments for “Learning with Symmetric 
Label Noise: The Importance of Being Unhinged” 

H Additional experimental results 

Table 4 reports the 0-1 error for a range of losses on the Long and Servedio [2010] dataset. TanBoost refers 
to the loss of Masnadi-Shirazi et al. [2010]. As before, we find the unhinged loss to generally hnd a good 
classiher. Observe that the relatively poor performance of the square and TanBoost loss can be attributed to 
the findings of Appendix B, F. 

We next report the 0-1 error and one minus the AUC for a range of datasets. We begin with a dataset 
of Mease and Wyner [2008], where X = [0,1]^°, and M is the uniform distribution. Further, we have 
rj: X ^ |(r(;*,x) > 2.5] for w* = [I 5 O 15 ], i.e. there is a sparse separating hyperplane. Table 5 reports 

the results on this dataset injected with various levels of symmetric noise. On this dataset, the f-logistic loss 
generally performs the best. 

Finally, we report the 0-1 error and one minus the AUC on some UCI datasets in Tables 6-7. Table 3 
summarises statistics of the UCI data. Several datasets are imbalanced, meaning that 0-1 error is not the ideal 
measure of performance (as it can be made small with a trivial majority classiher). The AUC is thus arguably 
a better indication of performance for these datasets. We generally hnd that at high noise rates (40%), the 
AUC of the unhinged loss is superior to that of other losses. 


Dataset 

N 

D 

P(Y = 1) 

Iris 

150 

4 

0.3333 

Ionosphere 

351 

34 

0.3590 

Housing 

506 

13 

0.0692 

Car 

1,728 

8 

0.0376 

USPS 0v7 

2,200 

256 

0.5000 

Splice 

3,190 

61 

0.2404 

Spambase 

4,601 

57 

0.3940 


Table 3: Summary of UCI datasets. Here, N denotes the total number of samples, and D the dimensionality 
of the feature space. 
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Hinge Logistic Square t-logistic TanBoost Unhinged 


p = 

= 0 

0.00 

± 

0.00 

0.00 

± 

0.00 

0.25 

± 

0.00 

0.00 

± 

0.00 

0.25 

± 

0.00 

0.00 

± 

0.00 

p = 

= 0.1 

0.15 

± 

0.27 

0.24 

± 

0.05 

0.25 

± 

0.00 

0.00 

± 

0.00 

0.25 

± 

0.00 

0.00 

± 

0.00 

p = 

= 0.2 

0.21 

± 

0.30 

0.25 

± 

0.00 

0.25 

± 

0.00 

0.00 

± 

0.00 

0.25 

± 

0.00 

0.00 

± 

0.00 

p = 

= 0.3 

0.38 

± 

0.37 

0.25 

± 

0.03 

0.25 

± 

0.02 

0.22 

± 

0.08 

0.25 

± 

0.03 

0.00 

± 

0.00 

p = 

= 0.4 

0.42 

± 

0.36 

0.22 

± 

0.08 

0.22 

± 

0.08 

0.22 

± 

0.08 

0.22 

± 

0.08 

0.00 

± 

0.00 

p = 

= 0.49 

0.46 

± 

0.38 

0.39 

± 

0.23 

0.39 

± 

0.23 

0.39 

± 

0.23 

0.39 

± 

0.23 

0.34 

± 

0.48 


Table 4; Results on Long and Servedio [2010] dataset. Reported is the mean and standard deviation of the 
0-1 error over 125 trials. Grayed cells denote the best performer at that noise rate. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.02 ± 0.00 

0.01 ± 0.00 

0.03 ± 0.00 

0.01 ± 0.00 

0.02 ± 0.00 

0.05 ± 0.00 

P = 

0.1 

0.13 ±0.01 

0.05 ± 0.01 

0.06 ± 0.01 

0.03 ± 0.01 

0.05 ± 0.01 

0.06 ± 0.01 

P = 

0.2 

0.14 ±0.01 

0.09 ± 0.02 

0.09 ± 0.02 

0.06 ± 0.02 

0.08 ± 0.02 

0.08 ± 0.02 

P = 

0.3 

0.15 ±0.01 

0.13 ±0.03 

0.13 ±0.03 

0.12 ±0.03 

0.12 ±0.03 

0.12 ±0.02 

P = 

0.4 

0.17 ±0.05 

0.24 ± 0.08 

0.24 ± 0.08 

0.23 ± 0.07 

0.23 ± 0.08 

0.23 ± 0.08 

P = 

0.49 

0.47 ± 0.24 

0.46 ±0.11 

0.47 ±0.11 

0.48 ±0.10 

0.47 ±0.12 

0.48 ±0.12 


(a) 0-1 Error. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.00 ± 0.00 

0.00 ± 0.00 

0.01 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.01 ± 0.00 

P = 

0.1 

0.25 ±0.10 

0.02 ± 0.01 

0.02 ± 0.01 

0.00 ± 0.00 

0.02 ± 0.01 

0.02 ± 0.01 

P = 

0.2 

0.34 ±0.10 

0.05 ± 0.02 

0.05 ± 0.02 

0.02 ± 0.01 

0.04 ± 0.02 

0.05 ± 0.02 

P = 

0.3 

0.41 ±0.11 

0.11 ±0.04 

0.11 ±0.04 

0.09 ± 0.04 

0.11 ±0.04 

0.10 ±0.04 

P = 

0.4 

0.44 ±0.12 

0.24 ± 0.08 

0.24 ± 0.08 

0.24 ± 0.08 

0.24 ± 0.08 

0.23 ± 0.08 

P = 

0.49 

0.50 ±0.13 

0.47 ±0.11 

0.47 ±0.11 

0.47 ±0.11 

0.47 ±0.11 

0.46 ±0.11 


(b) 1 - AUC. 


Table 5; Results on mease dataset. Reported is the mean and standard deviation of performance over 125 
trials. Grayed cells denote the best performer at that noise rate. 
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Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

p = 

0 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

p = 

0.1 

0.01 ± 0.03 

0.01 ± 0.01 

0.01 ± 0.02 

0.01 ± 0.03 

0.01 ± 0.02 

0.00 ± 0.00 

p = 

0.2 

0.06 ± 0.12 

0.02 ± 0.05 

0.03 ± 0.04 

0.04 ± 0.05 

0.03 ± 0.05 

0.00 ± 0.01 

p = 

0.3 

0.17 ± 0.20 

0.09 ± 0.10 

0.08 ± 0.09 

0.09 ±0.11 

0.09 ±0.10 

0.02 ± 0.07 

p = 

0.4 

0.35 ± 0.24 

0.24 ±0.17 

0.24 ±0.17 

0.24 ±0.16 

0.24 ±0.17 

0.13 ± 0.22 

p = 

0.49 

0.60 ± 0.20 

0.49 ± 0.20 

0.49 ±0.19 

0.49 ± 0.20 

0.49 ±0.19 

0.45 ± 0.33 


(a) 0-1 Error. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

P = 

0.1 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

P = 

0.2 

0.03 ±0.11 

0.00 ± 0.01 

0.00 ± 0.00 

0.00 ± 0.01 

0.00 ± 0.01 

0.00 ± 0.00 

P = 

0.3 

0.14 ±0.26 

0.02 ± 0.06 

0.02 ± 0.05 

0.02 ± 0.06 

0.02 ± 0.05 

0.01 ± 0.06 

P = 

0.4 

0.36 ± 0.38 

0.13 ±0.18 

0.13 ±0.18 

0.14 ±0.18 

0.13 ±0.18 

0.09 ± 0.27 

P = 

0.49 

0.72 ± 0.34 

0.47 ±0.31 

0.48 ± 0.30 

0.48 ± 0.30 

0.48 ± 0.30 

0.45 ± 0.48 


(b) 1 - AUC. 


Table 6: Results on iris dataset. Reported is the mean and standard deviation of performance over 125 
trials. Grayed cells denote the best performer at that noise rate. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.11 ±0.00 

0.13 ± 0.00 

0.17 ±0.00 

0.24 ± 0.00 

0.17 ±0.00 

0.20 ± 0.00 

P = 

0.1 

0.17 ± 0.04 

0.18 ± 0.04 

0.16 ±0.03 

0.19 ±0.05 

0.17 ±0.04 

0.19 ± 0.02 

P = 

0.2 

0.20 ± 0.05 

0.19 ± 0.05 

0.18 ±0.04 

0.21 ± 0.06 

0.18 ±0.04 

0.19 ± 0.02 

P = 

0.3 

0.23 ± 0.06 

0.22 ± 0.05 

0.22 ± 0.05 

0.24 ± 0.06 

0.22 ± 0.05 

0.21 ± 0.03 

P = 

0.4 

0.31 ±0.11 

0.31 ±0.10 

0.29 ± 0.09 

0.32 ± 0.09 

0.30 ±0.10 

0.27 ±0.12 

P = 

0.49 

0.48 ± 0.16 

0.47 ± 0.16 

0.47 ± 0.16 

0.47 ±0.14 

0.45 ±0.15 

0.46 ± 0.22 


(a) 0-1 Error. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.12 ±0.00 

0.13 ±0.00 

0.07 ± 0.00 

0.20 ± 0.00 

0.07 ± 0.00 

0.21 ± 0.00 

P = 

0.1 

0.18 ± 0.07 

0.18 ± 0.07 

0.12 ±0.04 

0.22 ± 0.07 

0.13 ±0.05 

0.21 ± 0.00 

P = 

0.2 

0.23 ± 0.09 

0.22 ± 0.09 

0.18 ±0.07 

0.25 ± 0.08 

0.19 ±0.08 

0.21 ± 0.01 

P = 

0.3 

0.31 ±0.11 

0.29 ± 0.09 

0.26 ± 0.09 

0.30 ± 0.09 

0.27 ± 0.09 

0.21 ± 0.01 

P = 

0.4 

0.40 ±0.11 

0.40 ±0.10 

0.38 ±0.10 

0.40 ±0.10 

0.38 ±0.10 

0.25 ±0.12 

P = 

0.49 

0.49 ±0.12 

0.50 ±0.10 

0.50 ±0.10 

0.50 ±0.10 

0.50 ±0.10 

0.46 ± 0.25 


(b) 1 - AUC. 


Table 7; Results on ionosphere dataset. Reported is the mean and standard deviation of performance over 
125 trials. Grayed cells denote the best performer at that noise rate. 


25 










Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

p = 

0 

0.05 ± 0.00 

0.05 ± 0.00 

0.07 ± 0.00 

0.05 ± 0.00 

0.07 ± 0.00 

0.05 ± 0.00 

p = 

0.1 

0.06 ± 0.01 

0.06 ± 0.02 

0.07 ± 0.02 

0.07 ± 0.02 

0.07 ± 0.02 

0.05 ± 0.00 

p = 

0.2 

0.06 ± 0.01 

0.07 ± 0.03 

0.07 ± 0.02 

0.08 ± 0.03 

0.07 ± 0.02 

0.05 ± 0.00 

p = 

0.3 

0.08 ± 0.04 

0.10 ±0.06 

0.11 ±0.06 

0.11 ±0.05 

0.11 ±0.06 

0.05 ± 0.01 

p = 

0.4 

0.14 ±0.10 

0.21 ±0.12 

0.22 ±0.12 

0.24 ±0.13 

0.22 ±0.13 

0.09 ±0.10 

p = 

0.49 

0.45 ± 0.26 

0.49 ±0.16 

0.50 ±0.16 

0.49 ±0.16 

0.51 ±0.17 

0.46 ± 0.30 


(a) 0-1 Error. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.25 ± 0.00 

0.15 ±0.00 

0.17 ±0.00 

0.25 ± 0.00 

0.17 ±0.00 

0.69 ± 0.00 

P = 

0.1 

0.38 ±0.12 

0.27 ± 0.07 

0.27 ± 0.07 

0.30 ± 0.09 

0.27 ± 0.07 

0.69 ± 0.00 

P = 

0.2 

0.41 ±0.13 

0.35 ±0.10 

0.35 ±0.10 

0.35 ±0.10 

0.35 ±0.10 

0.68 ± 0.00 

P = 

0.3 

0.44 ±0.12 

0.40 ±0.11 

0.40 ±0.11 

0.40 ±0.11 

0.40 ±0.11 

0.69 ± 0.01 

P = 

0.4 

0.43 ± 0.12 

0.45 ±0.12 

0.45 ±0.12 

0.45 ±0.12 

0.45 ±0.12 

0.68 ± 0.02 

P = 

0.49 

0.45 ±0.13 

0.49 ±0.13 

0.49 ±0.13 

0.49 ±0.13 

0.49 ±0.13 

0.57 ±0.16 


(b) 1 - AUC. 


Table 8: Results on housing dataset. Reported is the mean and standard deviation of performance over 125 
trials. Grayed cells denote the best performer at that noise rate. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.01 ± 0.00 

0.02 ± 0.00 

0.03 ± 0.00 

0.03 ± 0.00 

0.02 ± 0.00 

0.03 ± 0.00 

P = 

0.1 

0.05 ± 0.00 

0.04 ± 0.01 

0.04 ± 0.01 

0.02 ± 0.01 

0.04 ± 0.01 

0.04 ± 0.01 

P = 

0.2 

0.05 ± 0.00 

0.05 ± 0.01 

0.05 ± 0.01 

0.04 ± 0.01 

0.05 ± 0.01 

0.05 ± 0.01 

P = 

0.3 

0.05 ± 0.01 

0.06 ± 0.01 

0.06 ± 0.01 

0.06 ± 0.02 

0.06 ± 0.01 

0.06 ± 0.01 

P = 

0.4 

0.06 ± 0.02 

0.11 ±0.06 

0.11 ±0.06 

0.11 ±0.06 

0.11 ±0.06 

0.10 ±0.05 

P = 

0.49 

0.33 ± 0.27 

0.46 ±0.16 

0.46 ±0.16 

0.47 ±0.16 

0.47 ±0.16 

0.46 ±0.16 

(a) 0-1 Error. 



Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.00 ± 0.00 

0.00 ± 0.00 

0.01 ± 0.00 

0.00 ± 0.00 

0.01 ± 0.00 

0.02 ± 0.00 

P = 

0.1 

0.34 ±0.18 

0.03 ± 0.02 

0.03 ± 0.02 

0.00 ± 0.00 

0.03 ± 0.02 

0.04 ± 0.02 

P = 

0.2 

0.40 ±0.17 

0.07 ± 0.05 

0.08 ± 0.05 

0.04 ± 0.04 

0.07 ± 0.05 

0.08 ± 0.05 

P = 

0.3 

0.43 ± 0.17 

0.17 ±0.10 

0.17 ±0.10 

0.14 ±0.10 

0.16 ±0.10 

0.16 ±0.10 

P = 

0.4 

0.44 ±0.18 

0.30 ±0.16 

0.30 ±0.16 

0.30 ±0.16 

0.30 ±0.16 

0.30 ±0.16 

P = 

0.49 

0.51 ±0.19 

0.46 ±0.17 

0.46 ±0.17 

0.46 ±0.17 

0.46 ±0.17 

0.46 ±0.18 


(b) 1 - AUC. 


Table 9; Results on car dataset. Reported is the mean and standard deviation of performance over 125 trials. 
Grayed cells denote the best performer at that noise rate. 
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Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

p = 

0 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

p = 

0.1 

0.10 ±0.08 

0.05 ± 0.01 

0.01 ± 0.01 

0.11 ±0.02 

0.02 ± 0.01 

0.00 ± 0.00 

p = 

0.2 

0.19 ±0.11 

0.09 ± 0.02 

0.05 ± 0.02 

0.15 ±0.02 

0.06 ± 0.02 

0.00 ± 0.00 

p = 

0.3 

0.31 ±0.13 

0.17 ± 0.03 

0.14 ±0.02 

0.22 ± 0.03 

0.16 ±0.03 

0.01 ± 0.00 

p = 

0.4 

0.39 ±0.13 

0.31 ±0.04 

0.30 ± 0.04 

0.33 ± 0.04 

0.31 ±0.04 

0.02 ± 0.02 

p = 

0.49 

0.50 ±0.16 

0.48 ± 0.04 

0.47 ± 0.04 

0.48 ± 0.04 

0.48 ± 0.04 

0.34 ± 0.21 


(a) 0-1 Error. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

0.00 ± 0.00 

P = 

0.1 

0.05 ± 0.06 

0.01 ± 0.00 

0.00 ± 0.00 

0.05 ± 0.01 

0.00 ± 0.00 

0.00 ± 0.00 

P = 

0.2 

0.12 ±0.11 

0.03 ± 0.01 

0.01 ± 0.00 

0.07 ± 0.01 

0.02 ± 0.01 

0.00 ± 0.00 

P = 

0.3 

0.26 ±0.18 

0.10 ±0.02 

0.07 ± 0.02 

0.14 ±0.03 

0.08 ± 0.02 

0.00 ± 0.00 

P = 

0.4 

0.37 ±0.19 

0.25 ± 0.04 

0.24 ± 0.04 

0.27 ± 0.04 

0.24 ± 0.04 

0.00 ± 0.00 

P = 

0.49 

0.51 ±0.23 

0.47 ± 0.05 

0.46 ± 0.05 

0.47 ± 0.05 

0.47 ± 0.05 

0.25 ± 0.29 


(b) 1 - AUC. 


Table 10; Results on usps_0_vs_7 dataset. Reported is the mean and standard deviation of performance 
over 125 trials. Grayed cells denote the best performer at that noise rate. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.05 ± 0.00 

0.04 ± 0.00 

0.02 ± 0.00 

0.04 ± 0.00 

0.02 ± 0.00 

0.19 ± 0.00 

P = 

0.1 

0.15 ± 0.03 

0.05 ± 0.01 

0.04 ± 0.01 

0.24 ± 0.00 

0.04 ± 0.01 

0.19 ±0.01 

P = 

0.2 

0.21 ± 0.03 

0.08 ± 0.01 

0.07 ± 0.01 

0.24 ± 0.00 

0.07 ± 0.01 

0.19 ±0.01 

P = 

0.3 

0.25 ± 0.03 

0.14 ±0.02 

0.14 ±0.02 

0.24 ± 0.00 

0.14 ±0.02 

0.19 ± 0.03 

P = 

0.4 

0.31 ±0.05 

0.28 ± 0.05 

0.28 ± 0.04 

0.24 ± 0.00 

0.28 ± 0.04 

0.22 ± 0.05 

P = 

0.49 

0.48 ± 0.09 

0.47 ± 0.06 

0.48 ± 0.05 

0.40 ± 0.24 

0.48 ± 0.05 

0.45 ± 0.08 

(a) 0-1 Error. 



Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.01 ± 0.00 

0.01 ± 0.00 

0.00 ± 0.00 

0.01 ± 0.00 

0.00 ± 0.00 

0.09 ± 0.00 

P = 

0.1 

0.10 ±0.03 

0.01 ± 0.00 

0.01 ± 0.00 

0.03 ± 0.01 

0.01 ± 0.00 

0.09 ± 0.01 

P = 

0.2 

0.20 ± 0.05 

0.03 ± 0.01 

0.02 ± 0.01 

0.04 ± 0.01 

0.02 ± 0.01 

0.10 ±0.02 

P = 

0.3 

0.30 ± 0.06 

0.08 ± 0.02 

0.08 ± 0.02 

0.09 ± 0.02 

0.07 ± 0.02 

0.11 ±0.03 

P = 

0.4 

0.40 ± 0.07 

0.22 ± 0.04 

0.22 ± 0.04 

0.23 ± 0.04 

0.22 ± 0.04 

0.16 ±0.07 

P = 

0.49 

0.49 ± 0.08 

0.46 ± 0.05 

0.46 ± 0.05 

0.46 ± 0.05 

0.45 ± 0.05 

0.42 ±0.15 


(b) 1 - AUC. 


Table 11; Results on splice dataset. Reported is the mean and standard deviation of performance over 125 
trials. Grayed cells denote the best performer at that noise rate. 
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Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

p = 

0 

0.16 ±0.01 

0.08 ± 0.00 

0.10 ±0.00 

0.24 ± 0.00 

0.09 ± 0.00 

0.15 ± 0.00 

p = 

0.1 

0.14 ±0.03 

0.10 ±0.02 

0.10 ±0.01 

0.13 ±0.06 

0.10 ±0.01 

0.14 ±0.01 

p = 

0.2 

0.17 ± 0.03 

0.11 ±0.02 

0.11 ±0.01 

0.13 ±0.05 

0.11 ±0.01 

0.14 ±0.01 

p = 

0.3 

0.23 ± 0.05 

0.13 ± 0.02 

0.12 ±0.01 

0.14 ±0.04 

0.13 ±0.02 

0.15 ±0.01 

p = 

0.4 

0.33 ± 0.07 

0.20 ± 0.04 

0.19 ±0.03 

0.21 ± 0.04 

0.19 ±0.03 

0.17 ±0.03 

p = 

0.49 

0.49 ±0.10 

0.45 ± 0.07 

0.44 ± 0.07 

0.45 ± 0.07 

0.45 ± 0.07 

0.43 ±0.12 


(a) 0-1 Error. 




Hinge 

Logistic 

Square 

f-logistic 

TanBoost 

Unhinged 

P = 

0 

0.03 ± 0.00 

0.02 ± 0.00 

0.05 ± 0.00 

0.02 ± 0.00 

0.04 ± 0.00 

0.07 ± 0.00 

P = 

0.1 

0.06 ± 0.01 

0.04 ± 0.00 

0.05 ± 0.00 

0.03 ± 0.00 

0.04 ± 0.00 

0.07 ± 0.00 

P = 

0.2 

0.10 ±0.03 

0.05 ± 0.00 

0.05 ± 0.00 

0.04 ± 0.00 

0.05 ± 0.00 

0.07 ± 0.00 

P = 

0.3 

0.17 ±0.06 

0.06 ± 0.01 

0.06 ± 0.01 

0.06 ± 0.01 

0.06 ± 0.01 

0.07 ± 0.01 

P = 

0.4 

0.32 ±0.12 

0.12 ±0.02 

0.12 ±0.02 

0.12 ±0.02 

0.12 ±0.02 

0.09 ± 0.02 

P = 

0.49 

0.49 ± 0.14 

0.43 ± 0.08 

0.43 ± 0.08 

0.43 ± 0.07 

0.43 ± 0.08 

0.39 ±0.19 


(b) 1 - AUC. 


Table 12: Results on spambase dataset. Reported is the mean and standard deviation of performance over 
125 trials. Grayed cells denote the best performer at that noise rate. 
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